Introduction to Programming

R: Data Analysis (data visualization)

Hugo Lhuillier

Master in Economics, Sciences Po

Packages

Tidyverse

  • A package: a collection of functions, data, and documentation that extends the capabilities of base R
install.packages("tidyverse") # install a package 
library(tidyverse)            # load the library / package
  • tidyverse: a collection of packages that work very well together

Data visualization

ggplot2

  • Seems complicated to use
  • Generates quickly amazingly detailed plots
  • Practial question:

Do cars with big engines use more fuel than cars with small engines?

  • Will answer via the mpg data frame (displ: size, hwy: car’s fuel effiency)
head(mpg)
## # A tibble: 6 x 11
##   manufacturer model displ  year   cyl trans drv     cty   hwy fl    class
##   <chr>        <chr> <dbl> <int> <int> <chr> <chr> <int> <int> <chr> <chr>
## 1 audi         a4     1.80  1999     4 auto… f        18    29 p     comp…
## 2 audi         a4     1.80  1999     4 manu… f        21    29 p     comp…
## 3 audi         a4     2.00  2008     4 manu… f        20    31 p     comp…
## 4 audi         a4     2.00  2008     4 auto… f        21    30 p     comp…
## 5 audi         a4     2.80  1999     6 auto… f        16    26 p     comp…
## 6 audi         a4     2.80  1999     6 manu… f        18    26 p     comp…

ggplot2: default template

  • Every ggplot2 plot follows
ggplot(data = <DATA>) + <GEOM_FUNCTION>(mapping = aes(<MAPPINGS>))
  • ggplot() creates a coordinate system that you can add layers to
  • data: the dataset to be used in the graph
  • GEOM_FUNCTION: adds a layer to the plot
  • mapping and aes: defines how variables in your dataset are mapped to visual properties

Scatter plots

Scatter plot: basics

  • To plot hwy against displ
ggplot(data = mpg) + geom_point(mapping = aes(x = displ, y = hwy))

  • Outliers? Could be hybrids?

Scatter plot: adding another variable

  • Add a third variable to a 2d scatterplot by mapping it to an aesthetic
  • Aesthetic: visual property of the objects in the plot
  • To map an aesthetic to a variable, associate the name of the aesthetic to the name of the variable inside aes()
ggplot(data = mpg) + geom_point(mapping = aes(x = displ, y = hwy, color = class))

  • Can try with size, alpha or shape

Scatter plot: general aesthetic properties

  • Can also set the aesthetic properties of your geom manually
ggplot(data = mpg) + geom_point(mapping = aes(x = displ, y = hwy), color = "blue")

  • Can specify

    1. color (associated to a string)
    2. size (associated to a size in mm)
    3. shape (associated to a number between 0 and 20)

Scatter plot: exercice

  1. In mpg, identify which variables are categorical and which are continuous
  2. Map a continuous variable to color inside aes

Scatter plot: exercice

ggplot(data = mpg) + geom_point(mapping = aes(x = displ, y = hwy, color = cty))

Scatter plot: exercice

ggplot(data = mpg) + geom_point(mapping = aes(x = displ, y = hwy, color = year))

Facets & grids

Facets

  • Can also split the plot into facets (=subplots), via facet_wrap()
  • First argument: formula, created with ~ followed by a variable name – necessarily discrete (but not necessarily categorical)!
ggplot(data = mpg) + geom_point(mapping = aes(x = displ, y = hwy)) + 
  facet_wrap(~ class, nrow = 2)

Grids

  • Before: split with respect to one variable
  • Can also split with two variables with facet_grid()
ggplot(data = mpg) + geom_point(mapping = aes(x = displ, y = hwy)) + 
  facet_grid(drv ~ cyl)

Geometerical objects

Geometerical objects

  • ggplot2 is only about adding layers to a plot
  • A geom: geometrical object used by ggplot2 to represent data
  • Ex: plot hwy against displ, this time using geom_smooth instead of geom_point

Geometrical objects

ggplot(data = mpg) + geom_smooth(mapping = aes(x = displ, y = hwy))
## `geom_smooth()` using method = 'loess'

Geometerical objects

  • ggplot2 is only about adding layers to a plot
  • A geom: geometrical object used by ggplot2 to represent data
  • Some mapping arguments work only for some geom (tried to shape a line?)
  • In total, about 30 geoms provided by ggplot2

Geometrical objects

  • For some geoms (e.g. smooth), can you the group aesthetic
  • Group: ggplot2 will group the data for these geoms whenever used with a discrete variable, without adding a legend or distinguishing features
ggplot(data = mpg) + geom_smooth(mapping = aes(x = displ, y = hwy, group = drv))
## `geom_smooth()` using method = 'loess'

Multiple geoms

  • Of course, can add multiple layers to the same plot
ggplot(data = mpg) + 
  geom_point(mapping = aes(x = displ, y = hwy)) + 
  geom_smooth(mapping = aes(x = displ, y = hwy))
## `geom_smooth()` using method = 'loess'

Multiple geoms

  • In the previous command, some duplication of our code
geom_point(mapping = aes(x = displ, y = hwy)) + geom_smooth(mapping = aes(x = displ, y = hwy))
  • Instead, can define a global mapping directly in ggplot()
ggplot(data = mpg, mapping = aes(x = displ, y = hwy)) + geom_point() + geom_smooth()

Multiple geoms

  • If place mappings in a geom function in addition to a global mapping, ggplot2 will extend or overwrite the global mappings for that layer only
ggplot(data = mpg, mapping = aes(x = displ, y = hwy)) + 
  geom_point(mapping = aes(color = class)) + 
  geom_smooth()
## `geom_smooth()` using method = 'loess'

Multiple geoms

  • Can also specify data for each layer
ggplot(data = mpg, mapping = aes(x = displ, y = hwy)) + 
  geom_point(mapping = aes(color = class)) + 
  geom_smooth(data = filter(mpg, class == "subcompact"), se = FALSE)
## Warning: package 'bindrcpp' was built under R version 3.3.2
## `geom_smooth()` using method = 'loess'

Geoms: exercices

  • Reproduce these plots
## `geom_smooth()` using method = 'loess'
## `geom_smooth()` using method = 'loess'

Geoms: solutions

# first plot
ggplot(data = mpg, mapping = aes(x = displ, y = hwy, color = drv)) + 
  geom_point() + 
  geom_smooth(se = FALSE)
# second plot 
ggplot(data = mpg, mapping = aes(x = displ, y = hwy)) + 
  geom_point(mapping = aes(color = drv)) + 
  geom_smooth(mapping = aes(linetype = drv), se = FALSE)

Statistical transformations

Bar charts

  • Based on diamonds dataset, with variables s.a. price, carat, color, clarity, cut
  • Bar chart with geom_bar
  • Note: only one variable in the mapping
ggplot(data = diamonds) + geom_bar(mapping = aes(x = cut))

Bar charts

  • On the y-axis: counts the number of diamonds / category
  • “Count” is the default stat associated to geom_bar
ggplot(data = diamonds) + geom_bar(mapping = aes(x = cut, y = ..prop.., group = 1))

  • ggplot2 provides over 20 stats (e.g. stat_smooth is the stat associated to geom_smooth)

Bar charts: exercice

  • Run the cell below and understand what geom_col do?
ggplot(data = diamonds) + geom_col(mapping = aes(x = cut, y = carat))
  • Run the cell below and undestand why we need group
ggplot(data = diamonds) + geom_bar(mapping = aes(x = cut, y = ..prop..))
ggplot(data = diamonds) + geom_bar(mapping = aes(x = cut, y = ..prop.., group = "x"))

Position adjustments

Bar charts (cont.)

  • As for scatterplot, can add another variable to a bar chart
ggplot(data = diamonds) + geom_bar(mapping = aes(x = cut, fill = cut))
ggplot(data = diamonds) + geom_bar(mapping = aes(x = cut, fill = clarity))

Bar charts & position adjustment

  • By default, the bars are stacked
  • The stacking is performed by the position adjustment argument, specified by position

    1. position = "identity": overlaps the bars
    2. position = "fill": works like stacking, but makes each bars the same heigh
    3. position = "dodge": places overlapping objects beside one another

Bar charts & position adjustment

g  <- ggplot(data = diamonds, mapping = aes(x = cut, fill = clarity))
g + geom_bar(alpha = 1/5, position = "identity")
g + geom_bar(fill = NA, position = "identity")
g + geom_bar(position = "fill")
g + geom_bar(position = "dodge")

Position adjustments: exercice

  • Using geom_boxplot, play with the position argument. Specifically, based on the mpg dataframe, do a boxplot of hwy on class, differenting by drv.

    1. What is the default position?
    2. Which of fill, dodge and ìdentitiy work?

Position adjustments: solution

g <- ggplot(data = mpg, mapping = aes(x = class, y = hwy, fill = drv))
g + geom_boxplot()
g + geom_boxplot(position = "identity")

Coordinate systems

Coordinate systems

  • Default coordinate system is the Cartesian coordinate system
  • Others:

    • coord_flip(): switches the x and y axes
    • coord_polat(): switches to polar coordinates

Coordinate systems

# switches x and y
ggplot(data = mpg, mapping = aes(x = class, y = hwy)) + geom_boxplot() + coord_flip()

Coordinate systems

# polar coordinates
ggplot(data = diamonds) + 
  geom_bar(mapping = aes(x = cut, fill = cut), show.legend = FALSE) + 
  labs(x = NULL, y = NULL) + 
  coord_polar()